43 research outputs found
Unsupervised Interpretable Basis Extraction for Concept-Based Visual Explanations
An important line of research attempts to explain CNN image classifier
predictions and intermediate layer representations in terms of human
understandable concepts. In this work, we expand on previous works in the
literature that use annotated concept datasets to extract interpretable feature
space directions and propose an unsupervised post-hoc method to extract a
disentangling interpretable basis by looking for the rotation of the feature
space that explains sparse one-hot thresholded transformed representations of
pixel activations. We do experimentation with existing popular CNNs and
demonstrate the effectiveness of our method in extracting an interpretable
basis across network architectures and training datasets. We make extensions to
the existing basis interpretability metrics found in the literature and show
that, intermediate layer representations become more interpretable when
transformed to the bases extracted with our method. Finally, using the basis
interpretability metrics, we compare the bases extracted with our method with
the bases derived with a supervised approach and find that, in one aspect, the
proposed unsupervised approach has a strength that constitutes a limitation of
the supervised one and give potential directions for future research.Comment: 15 pages, Accepted in IEEE Transactions on Artificial Intelligence,
Special Issue on New Developments in Explainable and Interpretable A
DeepMoCap: Deep Optical Motion Capture Using Multiple Depth Sensors and Retro-Reflectors
In this paper, a marker-based, single-person optical motion capture method (DeepMoCap) is proposed using multiple spatio-temporally aligned infrared-depth sensors and retro-reflective straps and patches (reflectors). DeepMoCap explores motion capture by automatically localizing and labeling reflectors on depth images and, subsequently, on 3D space. Introducing a non-parametric representation to encode the temporal correlation among pairs of colorized depthmaps and 3D optical flow frames, a multi-stage Fully Convolutional Network (FCN) architecture is proposed to jointly learn reflector locations and their temporal dependency among sequential frames. The extracted reflector 2D locations are spatially mapped in 3D space, resulting in robust 3D optical data extraction. The subject’s motion is efficiently captured by applying a template-based fitting technique on the extracted optical data. Two datasets have been created and made publicly available for evaluation purposes; one comprising multi-view depth and 3D optical flow annotated images (DMC2.5D), and a second, consisting of spatio-temporally aligned multi-view depth images along with skeleton, inertial and ground truth MoCap data (DMC3D). The FCN model outperforms its competitors on the DMC2.5D dataset using 2D Percentage of Correct Keypoints (PCK) metric, while the motion capture outcome is evaluated against RGB-D and inertial data fusion approaches on DMC3D, outperforming the next best method by 4.5% in total 3D PCK accuracy
Markerless structure-based multi-sensor calibration for free viewpoint video capture
Free-viewpoint capture technologies have recently started demonstrating impressive results. Being able to capture
human performances in full 3D is a very promising technology for a variety of applications. However, the setup
of the capturing infrastructure is usually expensive and requires trained personnel. In this work we focus on one
practical aspect of setting up a free-viewpoint capturing system, the spatial alignment of the sensors. Our work aims
at simplifying the external calibration process that typically requires significant human intervention and technical
knowledge. Our method uses an easy to assemble structure and unlike similar works, does not rely on markers or
features. Instead, we exploit the a-priori knowledge of the structure’s geometry to establish correspondences for
the little-overlapping viewpoints typically found in free-viewpoint capture setups. These establish an initial sparse
alignment that is then densely optimized. At the same time, our pipeline improves the robustness to assembly
errors, allowing for non-technical users to calibrate multi-sensor setups. Our results showcase the feasibility of our
approach that can make the tedious calibration process easier, and less error-prone